Progress Memo 1

Final Project
Data Science 1 with R (STAT 301-1)

Author

Chelsea Nelson

Published

December 6, 2023

Github Repo Link

Data source

I will be using the Economic Policy Institute’s data on family budget, which also then corresponding to telling us about the cost of living in each county in America. This dataset provides insights into the average economical weights of different aspects of life for each county in America both annually and monthly, whiles dividing these averages further by also looking at the different family types as well. I have provided an additional copy of the same dataset, with less columns and information from Kaggle. This in case that the analysis that in which I wish to conduct starts to not follow the overall larger EPI dataset. Thus turning towards the smaller one from Kaggle where I can focus in more on each variable, looking only at the annual costs and not both monthly and annually as in the larger EPI provided dataset. I additionally wanted to join this dataset with others that provide information on the minimum wage in each state and perhaps the racial majorities of each county, but I was unable to find suitable datasets, thus I will be adding these variables on myself during the first stages of my EDA. 1

Why this data

The reason I choose this dataset is because I thought it would be interesting to see how the cost of living differs not only on the level of state vs state, but looking further into the issue by seeing how cost of living differs by family size and county location. Additionally, I think this would be an opportunity to learn how the incomes levels and expenses in each county and state differ from each other, while also starting to understand why these differences are present. I further think that by looking additionally at minimum wage in each state, it will bring attention to a lot of the inequity present in the states. Thinking on this now, I could also create a variable which states what the majority race or ethnicity in each county is as well, I think this would be an interesting statistical insight to find and correlate to expense levels in the different counties and states.

Data quality & complexity check

Family Budget Dataset

Family Budget Dataset Codebook

Cost of Living Dataset

Skimming Datasets

Family Budget Dataset

Data summary
Name Piped data
Number of rows 31430
Number of columns 27
_______________________
Column type frequency:
character 4
numeric 23
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state_abv 0 1 2 2 0 51 0
areaname 0 1 12 183 0 2561 0
county 0 1 10 33 0 1877 0
family 0 1 4 4 0 10 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
case_id 0 1 1589.31 917.22 1.00 792.00 1593.00 2386.00 3171.00 ▇▇▇▇▇
metro 0 1 0.37 0.48 0.00 0.00 0.00 1.00 1.00 ▇▁▁▁▅
housing_monthly 0 1 922.81 347.13 350.78 715.00 868.00 1037.00 5144.63 ▇▁▁▁▁
food_monthly 0 1 690.63 272.60 185.02 483.45 677.43 891.97 2598.22 ▇▇▁▁▁
transportation_monthly 0 1 1132.82 136.70 184.71 1044.60 1141.51 1230.48 1651.37 ▁▁▃▇▁
healthcare_monthly 0 1 1116.17 433.71 289.70 805.62 1090.22 1388.15 3104.36 ▅▇▃▁▁
other_necessities_monthly 0 1 584.61 199.78 217.64 440.53 561.09 701.09 2402.45 ▇▅▁▁▁
childcare_monthly 0 1 823.30 564.85 0.00 445.14 847.20 1189.70 4069.26 ▇▇▁▁▁
taxes_monthly 0 1 638.14 278.32 85.65 466.50 574.87 732.52 3979.45 ▇▁▁▁▁
total_monthly 0 1 5908.47 1820.55 2507.31 4481.33 5914.81 7114.28 18643.13 ▇▇▁▁▁
housing_annual 0 1 11073.67 4165.61 4209.31 8580.00 10416.00 12444.00 61735.59 ▇▁▁▁▁
food_annual 0 1 8287.50 3271.14 2220.28 5801.42 8129.16 10703.62 31178.62 ▇▇▁▁▁
transportation_annual 0 1 13593.86 1640.46 2216.46 12535.16 13698.16 14765.76 19816.48 ▁▁▃▇▁
healthcare_annual 0 1 13394.03 5204.55 3476.38 9667.44 13082.70 16657.82 37252.27 ▅▇▃▁▁
other_necessities_annual 0 1 7015.32 2397.42 2611.64 5286.35 6733.06 8413.09 28829.44 ▇▅▁▁▁
childcare_annual 0 1 9879.58 6778.22 0.00 5341.62 10166.34 14276.38 48831.09 ▇▇▁▁▁
taxes_annual 0 1 7657.71 3339.80 1027.80 5597.97 6898.47 8790.21 47753.39 ▇▁▁▁▁
total_annual 0 1 70901.68 21846.55 30087.66 53776.02 70977.68 85371.34 223717.55 ▇▇▁▁▁
median_family_income 10 1 68316.00 16886.97 25529.98 57223.99 65955.61 76136.07 177662.47 ▂▇▁▁▁
num_counties_in_st 10 1 96.51 56.88 1.00 64.00 87.00 105.00 254.00 ▂▇▂▁▁
st_cost_rank 10 1 48.77 43.05 1.00 18.00 39.00 66.00 254.00 ▇▃▁▁▁
st_med_aff_rank 10 1 48.74 43.08 1.00 18.00 39.00 66.00 254.00 ▇▃▁▁▁
st_income_rank 10 1 48.77 43.05 1.00 18.00 39.00 66.00 254.00 ▇▃▁▁▁

Cost of Living Dataset

Data summary
Name Piped data
Number of rows 31430
Number of columns 15
_______________________
Column type frequency:
character 4
logical 1
numeric 10
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
state 0 1 2 2 0 51 0
areaname 0 1 12 183 0 2561 0
county 0 1 10 33 0 1877 0
family_member_count 0 1 4 4 0 10 0

Variable type: logical

skim_variable n_missing complete_rate mean count
isMetro 0 1 0.37 FAL: 19730, TRU: 11700

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
case_id 0 1 1589.31 917.22 1.00 792.00 1593.00 2386.00 3171.00 ▇▇▇▇▇
housing_cost 0 1 11073.67 4165.61 4209.31 8580.00 10416.00 12444.00 61735.59 ▇▁▁▁▁
food_cost 0 1 8287.50 3271.14 2220.28 5801.42 8129.16 10703.62 31178.62 ▇▇▁▁▁
transportation_cost 0 1 13593.86 1640.46 2216.46 12535.16 13698.16 14765.76 19816.48 ▁▁▃▇▁
healthcare_cost 0 1 13394.03 5204.55 3476.38 9667.44 13082.70 16657.82 37252.27 ▅▇▃▁▁
other_necessities_cost 0 1 7015.32 2397.42 2611.64 5286.35 6733.06 8413.09 28829.44 ▇▅▁▁▁
childcare_cost 0 1 9879.58 6778.22 0.00 5341.62 10166.34 14276.38 48831.09 ▇▇▁▁▁
taxes 0 1 7657.71 3339.80 1027.80 5597.97 6898.47 8790.21 47753.39 ▇▁▁▁▁
total_cost 0 1 70901.68 21846.55 30087.66 53776.02 70977.68 85371.34 223717.55 ▇▇▁▁▁
median_family_income 10 1 68316.00 16886.97 25529.98 57223.99 65955.61 76136.07 177662.47 ▂▇▁▁▁

The fbc_data there are 27 variables and 31,430 observations. Within this, there are twenty-three numerical variables and four categorical variables. However, I want to change the metro variable from being a numerical variable with only 0 and 1s to that of a logical later on. Five variables, which all pertain to different rankings and extra information, share the same amount of missing data. However, I feel that since it is such a small percentage, with each missing only 10 values (0.3181) respectively of their portion of the data, I will see how I can adapt or change my analysis to either ignore or get rid of this missingness altogether.

Looking at the smaller version of the data, which looks only at the annual costs has 15 variables and 31430 observations. Within this, there are five categorical variables and ten numerical variables. There is only one variable which has missing values, being the median_family_income variable. It is missing ten instances (0,3181 of the data), thus since it is a small number I will see how I can operate around this missingness. This variable was also one of the five missing in the larger dataset, thus this helps to prove that this dataset is the same as the larger fbc_data dataset, but just showcasing specific variables.

Potential data issues

I think the main issues that I will face is being able to figure out how to navigate looking at all the different levels that my data is assessing, relating to the different family types, different counties, different states, and how they differ between annual and monthly cost.

Additionally, I feel that I might struggle slightly with wanting to add in my own variables of minimum wage and the racial composition or racial majority of each county into my dataset and analysis.

Furthermore, I have to look more into where the missing data for the variables lie, and how I can navigate around them in order to still provide similar and consistent comparisons for each county and family type.

Misc

As I have been discussing above, I plan to add other information such as minimum wage for each state and the ethnic makeup or majority in each county, if I am able to find this for each or most counties in America. I plan to add the minimum wage through using the case when function, however I am unsure of how I want to add the ethnic makeup variable to my dataset. Currently, I think I will make my own dataset in excel, and then join it together with my current dataset.

Furthermore, I currently feel that I want to look at how minimum wage of each state corresponds with the annual cost in each county and how being in a metro area makes this either higher or lower.

Thinking about this, I was wondering if I should look to see if different metro areas have different minimum wages than the state as a whole as well.

Footnotes

  1. These datasets were sourced from EPI Family Budget Map & Kaggle US Cost of Living Dataset.↩︎